Skip to content

Feat/external eval server #23

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jul 18, 2025

Conversation

olesho
Copy link
Collaborator

@olesho olesho commented Jul 13, 2025

No description provided.

@tysonthomas9 tysonthomas9 requested a review from Copilot July 14, 2025 15:52
Copilot

This comment was marked as outdated.

@tysonthomas9 tysonthomas9 requested a review from Copilot July 17, 2025 23:20
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a comprehensive external evaluation server system for testing AI Chat agent tools in DevTools. The feature enables external evaluation of AI Chat tools via WebSocket RPC communication between DevTools and a standalone evaluation server.

  • External Evaluation Infrastructure: Complete WebSocket-based evaluation server with client management, authentication, and tool execution
  • DevTools Integration: New evaluation configuration UI in Settings Dialog with connection management and testing capabilities
  • Comprehensive Test Suite: 25+ evaluation test cases covering web task agents, schema extractors, and various real-world scenarios

Reviewed Changes

Copilot reviewed 114 out of 118 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
front_end/panels/ai_chat/ui/SettingsDialog.ts Adds evaluation configuration UI with connection status, endpoint settings, and test functionality
front_end/panels/ai_chat/evaluation/EvaluationAgent.ts Implements WebSocket client for receiving and executing evaluation requests from server
front_end/panels/ai_chat/evaluation/EvaluationProtocol.ts Defines message protocol for client-server evaluation communication
front_end/panels/ai_chat/common/EvaluationConfig.ts Manages evaluation configuration, connection state, and localStorage persistence
front_end/panels/ai_chat/common/WebSocketRPCClient.ts Generic WebSocket RPC client with reconnection and error handling
eval-server/src/server.js Main evaluation server with client management, authentication, and evaluation orchestration
eval-server/evals/web-task-agent/*.yaml Comprehensive test cases for web task agent covering social media, search, booking, finance, etc.
front_end/panels/ai_chat/**/ErrorHandlingUtils.ts Improves error logging consistency across multiple files
Files not reviewed (1)
  • eval-server/package-lock.json: Language not supported


// Set up periodic status updates every 2 seconds
const statusUpdateInterval = setInterval(updateConnectionStatus, 2000);

Copy link
Preview

Copilot AI Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The status update interval is created but never cleaned up when the dialog is closed or destroyed. This could lead to memory leaks and unnecessary function calls. Consider storing the interval ID and clearing it in a cleanup function.

Suggested change
// Ensure the interval is cleared when the dialog is closed
this.addEventListener('close', () => {
clearInterval(statusUpdateInterval);
});

Copilot uses AI. Check for mistakes.

const clientIdInput = document.createElement('input');
clientIdInput.type = 'text';
clientIdInput.className = 'settings-input';
clientIdInput.value = currentEvaluationConfig.clientId || 'Auto-generated on first connection';
Copy link
Preview

Copilot AI Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The fallback text 'Auto-generated on first connection' should be extracted to a constant or the UIStrings object for better maintainability and potential internationalization.

Suggested change
clientIdInput.value = currentEvaluationConfig.clientId || 'Auto-generated on first connection';
clientIdInput.value = currentEvaluationConfig.clientId || i18nString(UIStrings.autoGeneratedOnFirstConnection);

Copilot uses AI. Check for mistakes.

});
// For new clients, the server created the config and asks to reconnect
// We can attempt to reconnect after a short delay
setTimeout(() => {
Copy link
Preview

Copilot AI Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The hardcoded 1000ms delay for reconnection retry should be extracted to a constant or configuration parameter for better maintainability and testability.

Copilot uses AI. Check for mistakes.

timestamp: new Date().toISOString()
});
}
}, 30000); // Send ping every 30 seconds
Copy link
Preview

Copilot AI Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The hardcoded 30000ms (30 seconds) heartbeat interval should be extracted to a constant or configuration parameter for better maintainability.

Suggested change
}, 30000); // Send ping every 30 seconds
}, HEARTBEAT_INTERVAL_MS); // Send ping every 30 seconds

Copilot uses AI. Check for mistakes.

}

this.currentReconnectAttempt++;
const delay = this.reconnectDelay * Math.pow(2, this.currentReconnectAttempt - 1);
Copy link
Preview

Copilot AI Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exponential backoff calculation could result in very large delays for higher retry attempts. Consider adding a maximum delay cap to prevent excessively long wait times.

Suggested change
const delay = this.reconnectDelay * Math.pow(2, this.currentReconnectAttempt - 1);
const MAX_RECONNECT_DELAY = 30000; // Maximum delay cap in milliseconds (30 seconds)
const delay = Math.min(this.reconnectDelay * Math.pow(2, this.currentReconnectAttempt - 1), MAX_RECONNECT_DELAY);

Copilot uses AI. Check for mistakes.

// Auto-create new client configuration
try {
logger.info('Auto-creating new client configuration', { clientId });
await this.clientManager.createClientWithId(clientId, `DevTools Client ${clientId.substring(0, 8)}`, 'hello');
Copy link
Preview

Copilot AI Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded default secret key 'hello' poses a security risk. Consider generating a strong random secret key or requiring explicit secret key configuration for new clients.

Suggested change
await this.clientManager.createClientWithId(clientId, `DevTools Client ${clientId.substring(0, 8)}`, 'hello');
const randomSecretKey = crypto.randomBytes(32).toString('hex');
await this.clientManager.createClientWithId(clientId, `DevTools Client ${clientId.substring(0, 8)}`, randomSecretKey);

Copilot uses AI. Check for mistakes.

Comment on lines +137 to +138
// Don't automatically start evaluations - wait for manual trigger
// this.processClientEvaluations(connection.clientId);
Copy link
Preview

Copilot AI Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] This commented-out line and its explanation suggest incomplete functionality. Either implement the automatic evaluation feature or remove the commented code and explanation to avoid confusion.

Suggested change
// Don't automatically start evaluations - wait for manual trigger
// this.processClientEvaluations(connection.clientId);
// Evaluations must be manually triggered.

Copilot uses AI. Check for mistakes.

@tysonthomas9 tysonthomas9 merged commit 543b097 into tysonthomas9:main Jul 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants